TTU-WSU: GBAD
VAST 2009 Challenge
Challenge 1: Badge
and Network Traffic
Authors
and Affiliations:
·
Jeffrey
Graves, Tennessee Tech University, jagraves21@tntech.edu
·
William
Eberle, Tennessee Tech University, weberle@tntech.edu
[PRIMARY contact]
·
Lawrence
Holder, Washington State University, holder@wsu.edu
Tool(s):
In order
to analyze the badge and network traffic, we used the Graph-Based Anomaly
Detection (GBAD) tool to focus the visualization on interesting structural
anomalies. Initially created in 2006 as
a joint venture between the University of Texas at Arlington and Washington
State University, GBAD discovers anomalous instances of structural patterns in
data, where the data represents entities, relationships and actions in graph
form. Input to GBAD is a labeled graph in which entities are represented by
labeled vertices and relationships or actions are represented by labeled edges
between entities. Using the minimum
description length (MDL) principle to identify the normative pattern that
minimizes the number of bits needed to describe the input graph after being
compressed by the pattern, GBAD embodies novel algorithms for identifying the
three possible changes to a graph: modifications, insertions and deletions.
Each algorithm discovers those substructures that match the closest to the
normative pattern without matching exactly.
As a result, GBAD is looking for those activities that appear to match
normal patterns, but in fact are structurally different. GBAD is a Unix-based
tool written in C, and uses the SUBDUE graph-based data mining system
(www.subdue.org) as the engine for discovering the normative pattern in a
graph. GBAD was developed by William
Eberle and Lawrence Holder.
Video:
ANSWERS:
MC1.1: Identify which computer(s)
the employee most likely used to send information to his contact in a
tab-delimited table which contains for each computer identified: when the
information was sent, how much information was sent and where that information
was sent.
MC1.2: Characterize the
patterns of behavior of suspicious computer use.
In
order to analyze the badge and network traffic of employees, we used our
Graph-Based Anomaly Detection (GBAD) system.
GBAD takes a graph-representation of data and applies three algorithms
that analyze the graph for structural anomalies. Each of these algorithms is applied after the
normative graph structure has been discovered.
It is our hypothesis that such a system can discover knowledge in a
graph representation of the badge and network traffic data that will (1) show
the normal structure of the employee movements and network activity, and (2) show
anomalies in employee behavior, indicating a possible insider threat.
In order to answer the challenge,
we decided to focus on the movements and locations of the employees, along with
their connections to the network. Based
upon all of the information that was provided with the challenge, we made the
following assumptions about this particular data set:
Starting
with these simple assumptions, we created graphs based upon the movement of
employees between areas (outside, building, classified) and the number of
connections that were made by the employee each time they were in the building,
where vertices represented locations and network connections, and edges
indicated order of movements.
This process of creating graphs is performed manually, as the
choice of an appropriate graph topology is domain dependent. For this mini-challenge, our graphs consisted
of subgraphs that represented employee movements for a particular day. Each subgraph contained a backbone of
movement vertices. Attached to the movement vertices were two vertices representing where the person
started and ended (i.e., outside, building, classified). The edges were labeled start and end. If network traffic was sent before the person
moved again, a network vertex linked
to the movement vertex via a sends edge is created. The network
vertex was also linked to a vertex with a numerical label, representing how
many messages were sent before the next movement occurred. Also attached to a movement vertex via a time
edge was a vertex representing the time reported in the proximity log (e.g.,
early_morning 0:00-7:59, morning 8:00-11:59, after_noon 12:00-16:59, evening
17:00-20:59, night 21:00-23:59). A
numerical vertex representing the hour was also connected to the time vertex
via an hour edge.
Figure 1. Example subgraph.
In
the example shown in Figure 1, a person entered the building in the early_morning between 7AM and 8AM. The person sent 2 network messages and then
moved into the classified area in the morning
between 8AM and 9AM. The person then
left the classified area in the morning
between 9AM and 10AM.
A graph input file for the GBAD system is an ASCII text file that
defines the vertices with sequential numbers, and edges using these numbers to
specify a connection between two vertices.
Using a python script that converts the mini-challenge provided proxLog.csv and IPLog3.5.csv files, we generated graph input files that matched the
topology described above. An example
(partial) graph input file, created using this method, looks like the
following:
v 1 location
v 2 classified
e 1 2 location_type
…
GBAD is a
command-line program that can run on multiple operating systems (Linux,
Windows, etc.). Once the graph files are
created, GBAD is executed on each graph input file, returning (1) the normative
pattern discovered in the specified graph input file, and (2) the top-N most
anomalous patterns, where N is set to 1 by default. The graph input file and
discovered patterns can be converted to the dot format and visualized in GraphViz.
We initially created one graph of all employee activity for all
days. From that graph, we were able to
discover the normative pattern for all employees across all days. Figure 2 shows a visualization of the
normative pattern.
Figure 2. Normative pattern.
After
uncovering the normative pattern, GBAD then uses three algorithms to discover
all of the possible structural changes that can exist in a graph (i.e.,
modification, deletions, and insertions).
Both the process to discover the normative pattern and the anomalies is
done automatically with a single run of GBAD.
In order to determine which employee was the
insider threat, we manually ranked our observations based upon which employees
were involved in the following types of attributes:
·
Piggybacking
·
Movement
·
Network
activity
·
Time
of day
Based
upon these criteria, we suspected that employee number 38 was involved because
of patterns of behavior such as:
·
Exiting
the classified area with no record of entry (i.e., piggybacking).
·
Found
in the building sending network traffic with no report of how they got in
there.
·
Weekend
activity.
·
Large
number of network connections.
·
Activity
at unusual times of the day.
Figure
3 shows an example of one of the anomalous instances reported by GBAD for this
employee.
Figure 3.
Example of unusual movement by employee at an abnormal time for that
employee.
Some other interesting observations we made about other employee behavior
were:
·
Employee 12 likes working late – sometimes close to
midnight.
·
Employee 8 exits the building in the middle of the
day, after making a network connection, and returns later in the day.
·
Employee 26 moves around the facility significantly
more than other employees.
GBAD
can be used to detect anomalies of possible insider threat activity in a graph
representation of data that captures relational information. While we use
GraphViz to visualize the graph patterns, any graph visualization tool could be
used. The main point is that the ability to discover normative patterns and
anomalies is critical to the visual detection of insider threat activity in the
data.